FIFA 19 Data Exploration

fifa_19_image.jpg

Importing Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# loading dataset
df = pd.read_excel('data.xlsx')
df.head()
Out[1]:
Unnamed: 0 ID Name Age Photo Nationality Flag Overall Potential Club ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 0 158023 L. Messi 31 https://cdn.sofifa.org/players/4/19/158023.png Argentina https://cdn.sofifa.org/flags/52.png 94 94 FC Barcelona ... 96.0 33.0 28.0 26.0 6.0 11.0 15.0 14.0 8.0 €226.5M
1 1 20801 Cristiano Ronaldo 33 https://cdn.sofifa.org/players/4/19/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Juventus ... 95.0 28.0 31.0 23.0 7.0 11.0 15.0 14.0 11.0 €127.1M
2 2 190871 Neymar Jr 26 https://cdn.sofifa.org/players/4/19/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 93 Paris Saint-Germain ... 94.0 27.0 24.0 33.0 9.0 9.0 15.0 15.0 11.0 €228.1M
3 3 193080 De Gea 27 https://cdn.sofifa.org/players/4/19/193080.png Spain https://cdn.sofifa.org/flags/45.png 91 93 Manchester United ... 68.0 15.0 21.0 13.0 90.0 85.0 87.0 88.0 94.0 €138.6M
4 4 192985 K. De Bruyne 27 https://cdn.sofifa.org/players/4/19/192985.png Belgium https://cdn.sofifa.org/flags/7.png 91 92 Manchester City ... 88.0 68.0 58.0 51.0 15.0 13.0 5.0 10.0 13.0 €196.4M

5 rows × 89 columns

we wouldn't be using all of the features present here in our data set for our analysis, so we pick only those needed.

In [2]:
df_copy = df.copy()
In [3]:
df = df[['Age','Nationality', 'Club','Release Clause','Wage','Value','Preferred Foot','Position','Weight','Finishing',
   'Dribbling','BallControl','Stamina','Jumping','SlidingTackle','GKReflexes','Body Type']]
In [4]:
df
Out[4]:
Age Nationality Club Release Clause Wage Value Preferred Foot Position Weight Finishing Dribbling BallControl Stamina Jumping SlidingTackle GKReflexes Body Type
0 31 Argentina FC Barcelona €226.5M €565K €110.5M Left RF 159lbs 95.0 97.0 96.0 72.0 68.0 26.0 8.0 Stocky
1 33 Portugal Juventus €127.1M €405K €77M Right ST 183lbs 94.0 88.0 94.0 88.0 95.0 23.0 11.0 Normal
2 26 Brazil Paris Saint-Germain €228.1M €290K €118.5M Right LW 150lbs 87.0 96.0 95.0 81.0 61.0 33.0 11.0 Normal
3 27 Spain Manchester United €138.6M €260K €72M Right GK 168lbs 13.0 18.0 42.0 43.0 67.0 13.0 94.0 Lean
4 27 Belgium Manchester City €196.4M €355K €102M Right RCM 154lbs 82.0 86.0 91.0 90.0 63.0 51.0 13.0 Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18202 19 England Crewe Alexandra €143K €1K €60K Right CM 134lbs 38.0 42.0 43.0 40.0 55.0 47.0 9.0 Lean
18203 19 Sweden Trelleborgs FF €113K €1K €60K Right ST 170lbs 52.0 39.0 40.0 43.0 47.0 19.0 12.0 Normal
18204 16 England Cambridge United €165K €1K €60K Right ST 148lbs 40.0 45.0 44.0 55.0 60.0 11.0 13.0 Normal
18205 17 England Tranmere Rovers €143K €1K €60K Right RW 154lbs 50.0 51.0 52.0 40.0 42.0 27.0 9.0 Lean
18206 16 England Tranmere Rovers €165K €1K €60K Right CM 176lbs 34.0 43.0 51.0 47.0 62.0 50.0 9.0 Lean

18207 rows × 17 columns

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             18207 non-null  int64  
 1   Nationality     18207 non-null  object 
 2   Club            17966 non-null  object 
 3   Release Clause  16643 non-null  object 
 4   Wage            18207 non-null  object 
 5   Value           18207 non-null  object 
 6   Preferred Foot  18159 non-null  object 
 7   Position        18147 non-null  object 
 8   Weight          18159 non-null  object 
 9   Finishing       18159 non-null  float64
 10  Dribbling       18159 non-null  float64
 11  BallControl     18159 non-null  float64
 12  Stamina         18159 non-null  float64
 13  Jumping         18159 non-null  float64
 14  SlidingTackle   18159 non-null  float64
 15  GKReflexes      18159 non-null  float64
 16  Body Type       18207 non-null  object 
dtypes: float64(7), int64(1), object(9)
memory usage: 2.4+ MB

Data cleaning

In [6]:
# checjing for missing values 
df.isnull().sum()
Out[6]:
Age                  0
Nationality          0
Club               241
Release Clause    1564
Wage                 0
Value                0
Preferred Foot      48
Position            60
Weight              48
Finishing           48
Dribbling           48
BallControl         48
Stamina             48
Jumping             48
SlidingTackle       48
GKReflexes          48
Body Type            0
dtype: int64
In [7]:
# removing rows with missing values from our dataset
df.dropna(inplace=True)
In [8]:
df.isnull().sum()
Out[8]:
Age               0
Nationality       0
Club              0
Release Clause    0
Wage              0
Value             0
Preferred Foot    0
Position          0
Weight            0
Finishing         0
Dribbling         0
BallControl       0
Stamina           0
Jumping           0
SlidingTackle     0
GKReflexes        0
Body Type         0
dtype: int64
In [9]:
# remove lbs in weight column

df['Weight'] = df['Weight'].str.replace('lbs','')
In [10]:
# change weight column from object to int
df['Weight'] = df['Weight'].astype(int)
In [11]:
# checking if the changes has been implemented

df['Weight']
Out[11]:
0        159
1        183
2        150
3        168
4        154
        ... 
18202    134
18203    170
18204    148
18205    154
18206    176
Name: Weight, Length: 16643, dtype: int32

From observation, value, wage and release clause column represents thounsand with 'K' and million with 'M'. We can replace k with '000' and M with '000000' and also remove point(.) and change their data types to int.

In [12]:
# removing decimal point 

df['Value'] = df['Value'].str.replace('.','')
df['Release Clause'] = df['Release Clause'].str.replace('.','')
In [13]:
# replace K with '000' and M with '000000'

df['Value'] = df['Value'].str.replace('K','000')
df['Value'] = df['Value'].str.replace('M', '00000')

df['Release Clause'] = df['Release Clause'].str.replace('K','000')
df['Release Clause'] = df['Release Clause'].str.replace('M', '00000')

df['Wage'] = df['Wage'].str.replace('K','000')
In [14]:
df['Value']
Out[14]:
0        €110500000
1          €7700000
2        €118500000
3          €7200000
4         €10200000
            ...    
18202        €60000
18203        €60000
18204        €60000
18205        €60000
18206        €60000
Name: Value, Length: 16643, dtype: object
In [15]:
# remove the currency symbol '€' from the dataset

df['Value'] = df['Value'].str.replace('€','')
df['Release Clause'] = df['Release Clause'].str.replace('€','')
df['Wage'] = df['Wage'].str.replace('€','')
In [16]:
df.head(3)
Out[16]:
Age Nationality Club Release Clause Wage Value Preferred Foot Position Weight Finishing Dribbling BallControl Stamina Jumping SlidingTackle GKReflexes Body Type
0 31 Argentina FC Barcelona 226500000 565000 110500000 Left RF 159 95.0 97.0 96.0 72.0 68.0 26.0 8.0 Stocky
1 33 Portugal Juventus 127100000 405000 7700000 Right ST 183 94.0 88.0 94.0 88.0 95.0 23.0 11.0 Normal
2 26 Brazil Paris Saint-Germain 228100000 290000 118500000 Right LW 150 87.0 96.0 95.0 81.0 61.0 33.0 11.0 Normal
In [17]:
# change data types
df['Value'] = df['Value'].astype(int)
df['Wage'] = df['Wage'].astype(int)
df['Release Clause'] = df['Release Clause'].astype(int)
In [18]:
df.describe(include = 'all')
Out[18]:
Age Nationality Club Release Clause Wage Value Preferred Foot Position Weight Finishing Dribbling BallControl Stamina Jumping SlidingTackle GKReflexes Body Type
count 16643.000000 16643 16643 1.664300e+04 16643.000000 1.664300e+04 16643 16643 16643.000000 16643.000000 16643.000000 16643.000000 16643.000000 16643.000000 16643.000000 16643.000000 16643
unique NaN 161 651 NaN NaN NaN 2 27 NaN NaN NaN NaN NaN NaN NaN NaN 3
top NaN England Manchester City NaN NaN NaN Right ST NaN NaN NaN NaN NaN NaN NaN NaN Normal
freq NaN 1475 33 NaN NaN NaN 12823 1924 NaN NaN NaN NaN NaN NaN NaN NaN 9733
mean 25.226221 NaN NaN 4.161694e+06 9618.037613 1.611506e+06 NaN NaN 165.987202 45.257766 55.104729 58.136274 63.160007 65.120591 45.751607 16.837409 NaN
std 4.716588 NaN NaN 1.067936e+07 22263.518927 3.991862e+06 NaN NaN 15.575312 19.538677 19.008604 16.785044 16.064355 11.856488 21.295201 18.090985 NaN
min 16.000000 NaN NaN 1.300000e+04 1000.000000 1.000000e+04 NaN NaN 110.000000 2.000000 4.000000 5.000000 12.000000 15.000000 3.000000 1.000000 NaN
25% 21.000000 NaN NaN 4.545000e+05 1000.000000 2.800000e+05 NaN NaN 154.000000 30.000000 48.000000 54.000000 56.000000 58.000000 24.000000 8.000000 NaN
50% 25.000000 NaN NaN 1.000000e+06 3000.000000 6.000000e+05 NaN NaN 165.000000 48.000000 61.000000 63.000000 66.000000 66.000000 52.000000 11.000000 NaN
75% 29.000000 NaN NaN 2.800000e+06 8000.000000 1.300000e+06 NaN NaN 176.000000 61.000000 68.000000 69.000000 74.000000 73.000000 64.000000 14.000000 NaN
max 45.000000 NaN NaN 2.281000e+08 565000.000000 1.185000e+08 NaN NaN 243.000000 95.000000 97.000000 96.000000 96.000000 95.000000 91.000000 94.000000 NaN
In [34]:
df.to_csv('fifa_19_cleaned.csv', index=False)

Observation

  • The minimum age of a player is 16yrs and 75% of players are below 29 yrs of age.
  • The maximum weight of a player is 243lbs.
  • 50% of players have stamina rating of 66.

Univariate Exploration

What is the Age distribution of players?

In [19]:
df['Age'].hist();
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age distribution of players');
  • We have a right skewed distribution and majority of players are in their 20's.

What is the weight distribution of players?

In [20]:
sns.distplot(df['Weight']);
plt.ylabel('Frequency')
plt.title('Weight distributions of players');
  • We can see that the weight distribution of players looks normal.

What club is the top 10 most populated by players?

In [21]:
def n_bar_plot(dataFrame, col, a, b, title, x_label, y_label ):
    dataFrame[col].value_counts()[a:b].plot(kind='bar');
    plt.title(title);
    plt.xlabel(x_label)
    plt.ylabel(y_label);
    

n_bar_plot(df, 'Club', 0, 10, 'Most popular clubs', 'Club', 'count'  )
  • We can see that 6 out of 10 clubs are from England, with 3 spanish clubs and 1 French.

with the findings above, lets take a look at players nationality

Top 10 most player Nationality?

In [22]:
n_bar_plot(df, 'Nationality', 0, 10, 'Most Players Nationality', 'Nationality', 'Frequency')
  • Most players as we can see are from England which could also play a role to why the most populated clubs are England clubs.

Lets look at most preferred foot by players

In [23]:
def bar_plot(dataFrame, col, title, x_label, y_label):
    dataFrame[col].value_counts().plot(kind='bar');
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label);


bar_plot(df, 'Preferred Foot', 'Players Most Preferred Foot', 'Preferred Foot', 'count')
  • As we can see, most of the players are right footed and this is also true for the real world.

Players positioning

We have gotten insight on players preferred foot, lets take a look at players position to see where most players are situated on the pitch

In [24]:
plt.figure(figsize=(8,4))
bar_plot(df, 'Position', 'Players positioning', 'Position', 'count')
  • Most of the players in FIFA19 are strikers (ST) followed by GoalKeepers(GK)

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

  • I was able to notice that most players on the fifa 19 dataset are in their 20's.
  • The weights of players are normal
  • Though it didn't come as a suprise but most of the players are right footed.
  • I also went on to look at top 10 clubs and i found 6 England clubs from the Premeire League as part of it which shows the relevance of the Premiere League in the space of football.
  • Further looking at the dataset, i found out that the dataset is populated mostly by strikers (ST) according to their positions/role on the pitch.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Before making my visualizations, There were some missing observations which i cleaned in order to avoid errors and be able to make good analysis. Also, some of the column data types were changed to help get a perfect data for the job.

Bivariate Exploration

lets look at the correlation between Weight and Finishing

In [25]:
df.plot(x='Weight', y='Finishing', kind='scatter', title='Weight by finishing');
  • There seems to be no correlation between a players weight and fininishing which is right, as the weight of a player do not necessarily mean he will deliver a good finishing.

Correlation between Ball_control and finishing

In [26]:
df.plot(y='BallControl', x='Finishing', kind='scatter', title='finishing and Ball control');
  • there's a positive correlation between Ballcontrol and Finishing and to be honest it ought to. For a player to deliver good finishing, then he has to be able to control the ball pretty well.

Correlation between Dribbling and Finishing

In [27]:
df.plot(y='Dribbling', x='Finishing', kind='scatter', title='correlation between Dribbling and Finishing');
  • As we can see theres a strong positive correlation here with which we can all agree to.

Lets look at the value of player based on body type

In [29]:
base_color = sns.color_palette()[0]
sns.violinplot(x='Body Type', y='Value', color = base_color, data=df);
plt.title('Value of player by Body type');
  • As we can see, players with normal body type are valued more compared to the rest body type.

Club with most player wage

In [30]:
club_by_wage =df.groupby('Club')['Wage'].sum().sort_values(ascending=False).head(10)
club_by_wage.plot.bar(title='Top 10 club with most player wage')
plt.ylabel('Amount');
  • Real Madrid a spanish club is the club with most wage.

Most preferred foot of players

In [32]:
sns.catplot(x='Preferred Foot',y='Value', kind='bar', data=df);
plt.title('Most Preferred foot of players by Value');
  • Left footed players as we have seen have more value when comapared to right footed players.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • I took a look on the correlation between weight and finishing anf of course there was a negative correlation as truly, weight of a player do not necessarily affects its finishing.
  • From there i decided to see if ball control could affect a players finishing and the result provided a high positive correlation of which truly, for a player to have a good finishing then he must be able to control the ball to a certain point.
  • I went on further to investigate dribbling and finishing and like the last observation, the result was also a strong positive correlation.
  • I also decide to check a players value based on his preferred foot and i was suprise to find out that most left footed players are valued more.

Multivariate Exploration

In [33]:
sns.pairplot(df);

Age, Weight and Body_type of players

In [34]:
g = sns.FacetGrid(data = df, hue = 'Body Type', height = 8, aspect = 1.5, palette = 'viridis_r')
g.map(plt.scatter, 'Weight', 'Age');
g.add_legend();
plt.title('Age vs Weight and Body_type of players');
  • from observation, most young players weigh less and in accordance to their body type, they are lean

Players Body_type, Value and Preferred foot

In [35]:
sns.catplot(x='Body Type',y='Value', hue='Preferred Foot', kind='bar', data=df);
plt.title('Body_type vs Value and Preferred_foot');
  • players of stocky bosy type are more left footed and they have more value

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • There's a strong corelation between a players stamina and how he controls the ball. Which tells us now that players with good staminas can/are more like to control the ball more better and this feature can be used in sourcing a player into a club.
  • Most of the young players have lean body type and they weigh less.

Were there any interesting or surprising interactions between features?

  • it was interesting to find out that Goal keeper reflex has low positive correlation with his age.
In [ ]: